test sample
Align Y our Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization
TPT does not explicitly align the pre-trained CLIP to become aware of the test sample distribution. For the effective test-time adaptation of V -L foundation models, it is crucial to bridge the distribution gap between the pre-training dataset and the downstream evaluation set for high zero-shot generalization.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Sweden > Östergötland County > Linköping (0.04)
- Europe > Netherlands > South Holland > Delft (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
- Workflow (1.00)
- Research Report > New Finding (1.00)
- North America > United States > Texas > Denton County > Denton (0.14)
- North America > United States > New York > Monroe County > Rochester (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- Asia > Middle East > Jordan (0.04)
0be50b4590f1c5fdf4c8feddd63c4f67-Supplemental-Datasets_and_Benchmarks.pdf
In Figure 1 we demonstrate the common neighbor (CN) distribution among positive and negative test samples for ogbl-collab, ogbl-ppa, and ogbl-citation2. These results demonstrate that a vast majority of negative samples have no CNs. Since CNs is a typically good heuristic, this makes it easy to identify most negative samples. We further present the CN distribution of Cora, Citeseer, Pubmed, and ogbl-ddi in Figure 3. The CN distribution of Cora, Citeseer, and Pubmed are consistent with our previous observations on the OGB datasets in Figure 1. We note that ogbl-ddi exhibits a different distribution with other datasets. As compared to the other datasets, most of the negative samples in ogbl-ddi have common neighbors. This is likely because ogbl-ddi is considerably denser than the other graphs.
Efficient Lifelong Model Evaluation in an Era of Rapid Progress
Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling \textit{ever-expanding} large-scale benchmarks called \textit{Lifelong Benchmarks}. As exemplars of our approach, we create \textit{Lifelong-CIFAR10} and \textit{Lifelong-ImageNet}, containing (for now) 1.69M and 1.98M test samples, respectively. While reducing overfitting, lifelong benchmarks introduce a key challenge: the high cost of evaluating a growing number of models across an ever-expanding sample set. To address this challenge, we also introduce an efficient evaluation framework: \textit{Sort \& Search (S\&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples, enabling cost-effective lifelong benchmarking. Extensive empirical evaluations across $\sim$31,000 models demonstrate that \textit{S\&S} achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours ($\sim$1000x reduction) on a single A100 GPU, with low approximation error. As such, lifelong benchmarks offer a robust, practical solution to the ``benchmark exhaustion'' problem.
BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping
Adaptation of pretrained vision-language models such as CLIP to various downstream tasks have raised great interest in recent researches. Previous works have proposed a variety of test-time adaptation (TTA) methods to achieve strong generalization without any knowledge of the target domain. However, existing training-required TTA approaches like TPT necessitate entropy minimization that involves large computational overhead, while training-free methods like TDA overlook the potential for information mining from the test samples themselves.In this paper, we break down the design of existing popular training-required and training-free TTA methods and bridge the gap between them within our framework.Specifically, we maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples. The historical samples are filtered from the testing data stream and serve to extract useful information from the target distribution, while the boosting samples are drawn from regional bootstrapping and capture the knowledge of the test sample itself.We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets, showcasing its applicability in real-world situations.
Actively Testing Your Model While It Learns: Realizing Label-Efficient Learning in Practice
In active learning (AL), we focus on reducing the data annotation cost from the model training perspective. However, testing'', which often refers to the model evaluation process of using empirical risk to estimate the intractable true generalization risk, also requires data annotations. The annotation cost for testing'' (model evaluation) is under-explored. Even in works that study active model evaluation or active testing (AT), the learning and testing ends are disconnected. In this paper, we propose a novel active testing while learning (ATL) framework that integrates active learning with active testing.